Mathematica Code Sample

Eric LaRose

February 21, 2021

This file contains all the relevant Python code for my final project in my Data Science III class, which replicates Wallace, Sharfstein, and Kaminsky (2019). Their paper uses k-means clustering to divide counties into 8 clusters based on their sociodemographic characteristics. My analysis attempts to recreate their findings and also considers additional clustering methods and robustness of results to the exclusion of outliers.

This Jupyter Notebook is divided intro three separate sections. The first imports the raw data and cleans it. The second does basic exploratory data analysis. The third and final section conducts the analyses using three clustering techniques: k-means, agglomerative hierarchical clustering, and density-based scanning.

Part 1: Import and Clean Data

In this section, we import and clean raw data from the following sources:

  1. The 2014-2018 5-year American Community Survey (ACS) from the Census Bureau provides reliable estimates from all counties on the various socioeconomic and demographic characteristics. We obtain the following county-level variables from this source (also used in Wallace, Sharfstein, and Kaminsky 2019): the percent of adults aged 25-44 with at least some college education, the percent of the population under 65 without health insurance, the percent of the population aged 16+ that is in the labor force but unemployed, the median age, the percent of the adult population that is married, the percent of the population that is female, and four variables corresponding to the percent of the population that is non-Hispanic white, non-Hispanic black, non-Hispanic Indian or Alaskan native, and Hispanic. All of these variables are used in the authors' k-means clustering. We also gather a variable on total population, which is used in exploratory data analysis but is not used in the clustering.
  2. The percent of the county population that is rural, from the 2010 Census provided by the Census Bureau. This is also used in k-means clustering.
  3. From the County Health Rankings dataset provided by the Robert Wood Johnson Foundation, we also import data on the following three health outcomes for each county: the percent of the adult population that are current smokers, the motor vehicle death rate per 100,000 population, and the percent of the adult population that is obese (defined as a Body Mass Index of at least 30). Note that these three variables are not used in the clustering but are rather used after clusters are created to rank counties within each cluster. We use the 2016 rankings as this aligns with the midpoint year of the 2014-2018 ACS.

After importing and cleaning these datasets, we combine them to form a completed dataset with each observation corresponding to a county.

1. American Community Survey

We use the censusdata package to import and clean this data.

2. Census Bureau

This data is downloaded from the National Historical Geographic Information System (NHGIS) from the 2010 Census, and we read it in below.

3. County Health Rankings (Robert Wood Johnson Foundation)

This data is downloaded from https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation/national-data-documentation-2010-2018.

Merge Data Together

The chunk of code below merges the data together and applies a few necessary transformations. We perform an inner join, so that we only keep counties that appear in all datasets.

Part 2: Exploratory Data Analysis

In this section, we perform some exploratory data analysis. First, we check for missing values in the chunk of code below.

We can see below that there is only one variable that has any missing observations, and that variable only has one missing value. We can drop this value.

We can see that we have 3,134 counties, compared to 3,139 in Wallace, Sharfstein, and Kaminsky (2019). Next, let's look at summary statistics of our data.

From above, we can see that many of these variables appear to be normally distributed. However, some have clear outliers. For instance, there is one county that appears to be only 21% female, roughly 12 standard deviations below the median! Let's help visualize some of these trends below and show that these outliers are mostly driven by counties with a very small population.

Additionally, let's look at the correlation of our three health variables.

We can see that smoking and obesity are fairly highly positively correlated, which is maybe expected as both are more behavioral aspects of health. We can see that driving deaths is very uncorrelated with either, which also makes intuitive sense. Let's visualize these correlations below.

With the plots above, we can see that there are definitely outliers in the data and they are primarily counties with small populations. All of these counties are included in the original data but we know that k-means clustering can be very sensitive to outliers. Thus, in the analysis below we also see how our results change when we remove counties with fewer than 10,000 residents. This threshold is somewhat arbitrary but removes virtually all of the most extreme outliers in our data.

We can see that we have dropped approximately 700 out of over 3,100 counties.

Next, let's do some preprocessing of our data before we conduct our cluster analysis. Particularly, we keep only the variables we use in the clustering and for both datasets (both including and excluding small counties) and standardize these variables in both datasets.

Part 3: Analysis

In this section we apply five clustering methods.

  1. $k$-means clustering using all counties and 8 clusters.
  2. $k$-means clustering using all counties and determining the optimal number of clusters.
  3. $k$-means clustering removing counties with fewer than 10,000 residents and determing the optimal number of clusters.
  4. Hierarchical clustering using all counties.
  5. Density-based clustering using all counties.

We then use cluster validation methods to assess the validity of each of the five methods and determine which methods perform relatively better than others. As discussed in more detail in the final report, we only remove potential outliers from $k$-means clustering because the other clustering techniques are much less senitive to outliers.

1. K-Means Clustering with 8 Clusters, All Counties

As a first step, we simply use k-means clustering with 8 clusters and including all counties, as done in Wallace, Sharfstein, and Kaminsky (2019). We then map these clusters and compare to the map in the paper. In addition to mapping the clusters, we also create percntiles of obesity within each cluster, and across all counties, and map the difference to compare that map to Wallace, Sharfstein, and Kaminsky (2019). This is discussed in more detail in the final report.

The plots and statistics above are discussed in more detail in the final report, but overall they are very similar to the findings discussed in Wallace, Sharfstein, and Kaminsky (2019). However, this takes their inclusion of all counties, their clustering method, and the number of clusters they use as given. Next, let's use SSE to determine the optimal number of clusters to use in this scenario.

2. K-Means Clustering with All Counties, Determining the Optimal Number of Clusters

In the chunks of code below we again use all counties. However, we vary the number of clusters and calculate the sum of squared errors (SSE) and average silhouette score resulting from each value, using $k=2,3,...,14,15$ clusters. We then plot both of these to select the number of clusters.

As discussed in more detail in the final report, the two plots above lead us to choose $k=5$. Thus, we find that the authors use too many clusters. Overall, the plots provide limited evidence to choose $k=8$, although it would not be unreasonable to choose $k=7$. Below, let's recreate the above two maps with $k=5$.

The differences between these maps using 5 clusters and the maps obtained by the authors are discussed in more detail in the final report.

3. K-Means Clustering Removing Small Counties, Determining the Optimal Number of Clusters

Here we repeat the analysis, but excluding counties with fewer than 10,000 residents. As discussed in more detail above and in the final report, $k$-means clustering is highly sensitive to outliers and thus it is worth investigating whether removing outliers substantially affects our results. The chunks of code below determine the optimal number of clusters only among counties with at least 10,000 residents.

As discussed in more detail in the final report, the two plots above again favor using $k=5$ clusters even when removing outliers. Below, let's visualize these resulting clusters and the within-cluster obesity percentiles.

From a visual inspection fo the maps above, excluding outliers actually does not seem to drastically alter our clusters. We will explore this in more detail in our cluster validation.

4. Hierarchical Clustering, Using All Counties

Here, we use agglomerative hierarchical clustering with the complete link method, using all counties. First, let's visualize a dendrogram to determine the optimal number of clusters.

As discussed in more detail in the final report, this leads us to choose 3 clusters. Let's repeat the process of making a map of this cluster below (we skip a percentile map as that is very uninformative in this case).

From above, we can see that the clusters produced by hierarchical clustering are unlikely to be very useful in this research context, as the vast majority of counties are in the same cluster. We discuss this in more detail in the final report.

5. Density-Based Scanning, Using All Counties

Here we apply density-based scanning to the dataset using all counties. As discussed in more detail in the final report, a main advantage of density-based scanning is that it can automatically identify and remove outliers as noise, so there is no need to "manually" drop outliers. Additionally, density-based scanning automatically determines the number of clusters.

Below, we apply density-based scanning and report the number of clusters. We will see that density-based clustering performs extremely poorly on this dataset, labelling most all data points as noise and producing only two distinct clusters.

The choropleth map above shows that density-baed clustering clearly has serious issues on this dataset, as there are only two clusters produced and the majority of counties are removed as noise.

Cluster Validation

In this final section of code, we validate the five clustering methods we applied. That is, we try to determine which perform better than others. In particular, we consider the following validation methods:

  1. Intuition. This simply involves us applying the "sniff" test on the data, i.e. making sure that the clusters make intuitive sense. From above, we can clearly see that hierarchical clustering and DBSCAN fail the sniff test, as they produce very few clusters which are clearly relatively uninformative.
  2. Comparing silhouette scores. Silhouette scores are discussed in more detail in the final report. We calculate silhouette scores for our five sets of clusters below.
  1. Predictive power on health outcomes. Here for each set of clusters, we create a set of dummy variables for the clusters and then, for each of the three health outcomes, run a regression of the health outcome on the set of dummy variables, including a constant, and record the $R^2$ from each regression. This method is discussed in more detail in the final report. First, we need to define a function that automates the process.